Distributed Computing Using Spark

Supervisor:	Anas Alzogbi, Geb. 051, Raum 01-028 Victor Anthony Arrascue Ayala, Geb. 051, Raum 01-028
Tutors:	Elira Daja
Time & Place:	Kick-off (Bring your laptop): Wednesday 17.10.2018, 14h – 18h (c.t.), Building 051, Room 01 029 Other meetings (see Experiments & Schedule for further details): Wednesdays, 14h – 18h (c.t.) (approx. every 2 weeks), Building 051, Room 01 029
Language:	Exercise sheets will be written in English. The meetings with the tutor will be in English.
Study:	Master

Registration

Please apply via HISinOne (Course Catalog) for this Lab Course, as the number of participants is limited to 6.

Pre-requisites:

Programming in Python (basic knowledge). Attendance in the lecture 'Data Analysis and Query Languages' in the summer term is higly recommended.

Experiments sheets

This course is based on practical experiment sheets that has to be solved individually. The submitted solutions will be marked and discussed with the tutor (compulsory attendance).

Content

The Web has undergone significant changes over the last decade. Currently, in the so-called Web 2.0 users are able to publish their own content, collaborate, discuss and form online communities. The continuously growing content brings with it new challenges to the current paradigms for combining data, content, and services from multiple sources. Therefore, sometimes it is not possible to escape from distributed storage and processing when it comes to create personalized experiences and applications.

Apache Spark did bring a revolution to the big data space. In fact, it has overtaken Hadoop, an open-source, distributed, Java-based framework, which consists of the Hadoop Distributed File System (HDFS) and MapReduce, its execution engine. Spark is nowadays the most active open source Big Data project. It is similar to Hadoop in that it's a distributed, general-purpose computing platform. However, by being able to keep large amounts of data in memory Spark programs can be executed up to 100 times faster than their MapReduce counterparts.

The purpose of this practical is to learn the capabilities of Spark in an incremental fashion. The developed application will be a Recommender System, i.e. a system which provides useful suggestions for users. The student will implement different kinds of recommendation algorithms in a distributed fashion while covering Spark's main capabilities: MapReduce, batch programming, real-time data-processing functions, SQL-like handling of structured data, graph algorithms, and machine learning.

Material

1. Spark in Action. Petar Zečević and Marko Bonaći.
3. Machine Learning with Spark Book by Nick Pentreath
2. Recommender Systems Handbook - 2nd ed. 2015. Francesco Ricci, Lior Rokach, Bracha Shapira.
Note: access within the university's network.

Used technologies

Docker
Jupyter Notebook
Python
Apache Spark
PySpark (Python API for Spark)

Zuletzt geändert am: 02.10.2018